Yousef Masluf

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

This project aims to use exploratory data analysis (EDA) techniques in order to explore relationships in one variable to multiple variables. I have selected red wine dataset for exploring visualizations, distributions, outliers, and anomalies.

The main question is “Which chemical properties influence the quality of red wines. Therefore, my main goal is that I will try to find out which chemical properties influence the quality of red wines and implement EDA tehniques using R programming language.

Univariate Plots Section

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##     Bad Average    Good perfect 
##      63     681     638     217

The above result shows that majority of the wines have been rated between 5 and 7.

This graph shows us the the minimum of fixed cidity is 4.60 and the maximun is 15.90. This graph is a right skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

This graph is normally distributed with mean of 0.5, minimum amount of volatile cidity of 0.12, and the maximun is 1.58

This graph shows us that the minimum of free sugar dioxide is 1 and the maximun is 72 and its mean is 15. This graph is a right skewed. The majority of free.sulfur.dioxide takes pace between 1 and 40.

Density is normally distributed. The graph is pill shaped

pH is normally distributed. The graph is pill shaped. The most of the data gather around the mean (0.65)

Univariate Analysis

There are 1599 observations in this dataset with 12 variables. Density and pH are normally distributed. free sugar dioxide is right skewed. Its mean is 15 and the mnimum amount is 1 and the maximun is 72 which demonistrate a long-tail. In addition, fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol also seem to be long-tailed.

Our main focus here is quality of the wine. Apart from that, We will also take a look at how other variables affect quality of the wine. The basic characteristics of a wine are sweetness, acidity, tannin, fruit and alcohol content. While our dataset do not have all features, I will try to look into other features that maight be very important in the process of rating a wine.

As mentioned above, I am looking forward to explore major features such as alcohol content, acidity, pH, and sugars. Then investigate how they affect quality of the wine.

The histogram reveals following observations: - Density and pH are normally distributed. - free sugar dioxide, Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol are long-tailed.

I created new variable and named it (quality_rank) in order to rank the quality of the wine pased on four different levels which are (bad, average, good, and perfect)

Bivariate Plots Section

Bivariate Analysis

Stacked histograms of variables with respect to quality did not reveal much information except that alcohol strongly affects wine quality. As we can see, correlation scatterplots showed strong positive correlation of alcohol with quality. Also, it showed strong negative correlation between volatile acidity and quality. Therefore, that lead us to a general observation which is good wine containa higher alcohol content, higher citric acidity, and lower volatile acidity. As we can see, there is a strong positive relationship between density and fixed.acidity. in the other hand, density has a negative relationship with fixed acidity

I observed interesting relationships between the other features such as - Fixed acidity vs citric acid is (0.67) - Fixed acidity vs density is (0.67) - Fixed acidity vs pH is (-0.68) - Volatile acidity vs citric acid is (-0.55)

The relationship between dinsity and fixed acidity was the strongest. Citric acid and fixed acidity showed a strong positive correlation of (68%). whereas pH and fixed acidity showed a strong negative correlation of (-68%).


Multivariate Plots Section

In the above plots, darker points indicates better quality wines. High citric acid and low acetic acid (volatile acidity)seems to be a good combination for the quality wine

Multivariate Analysis

After exploring many multivariate plots, i can say that a high quality wine can be made of combinations such as, - High alcohol rate and high sulphate level - High alcohol rate and low volatile acidity

While looking for interesting multivariate plots, I created three plots. In volatile acidity vs Alcohol plot, I added rank_quality as color and as a result, very interesting plot occured. There were clusters in the plot; high quality wines had low volatile acidity and total alcohol values, where mid and low quality wines had higher volatile acidity and total alcohol values. It was a big surprise for me since I did not expect such a plot.


Final Plots and Summary

Description One

Some wines contain more alcohol percentage by volume than others. However, we should keep loking at acidity. In case of wines with alcohol percentege more than 10% wines placed with low amount of acetic acid in wine.

Description Two

Quality rank depends on potassium sulphate value and this is unexpected result a little bit . If we look at sulphates vs quality level, we will find higher quality wines contain more sulphates S02 . Therefore, The higher the S02, the higher the quality if the wine which means that there is a strong correlation between sulphates and quality.

Description Three

From the plot which investigates the relationship between alchol and the quality rank, we can see though the data is lying everywhere, there?s a pattern can be drawn that the more alcohol percentage the more the quality of red wine.


Summary

The three final plots demonstrate wines quality from the several characteristics. As a result, the same quality wine level might have different proportions of sulphates, acidity and alcohol.

Reflection

It was interesting to explore this data set. I found it fascinating to determine what characteristics that make wine taste good since I don’t drink.

During my analysis I faced some difficulties such as: - some R functions. I used manual, blogs, books, etc.

Possible future researches: I am going to continue to explore the dataset and apply different methods of analysis in the future by studying the different types of wine and charachristics make a special taste for any indivdual.